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Abstract 



We propose and analyze a novel framework for learning sparse representations, based on two statistical tech- 
niques: kernel smoothing and marginal regression. The proposed approach provides a flexible framework for in- 
l_J 1 corporating feature similarity or temporal information present in data sets, via non-parametric kernel smoothing. 

We provide generalization bounds for dictionary learning using smooth sparse coding and show how the sample 
complexity depends on the L\ norm of kernel function used. Furthermore, we propose using marginal regression 
for obtaining sparse codes, which significantly improves the speed and allows one to scale to large dictionary sizes 
easily. We demonstrate the advantages of the proposed approach, both in terms of accuracy and speed by extensive 
experimentation on several real data sets. In addition, we demonstrate how the proposed approach could be used 
for improving semi-supervised sparse coding. 



1 Introduction 



Sparse coding is a popular unsupervised paradigm for learning sparse representations of data samples, that are subse- 
quently used in classification tasks. In standard sparse coding, each data sample is coded independently with respect 
\ to the dictionary. We propose a smooth alternative to traditional sparse coding that incorporates feature similarity, 
temporal or other user-specified domain information between the samples, into the coding process. 

The idea of smooth sparse coding is motivated by the relevance weighted likelihood principle. Our approach 
constructs a code that is efficient in a smooth sense and as a result leads to improved statistical accuracy over 
traditional sparse coding. The smoothing operation, which could be expressed as non-parametric kernel smoothing, 
provides a flexible framework for incorporating several types of domain information that might be available for the 
' user. For example, for image classification task, one could use: (1) kernels in feature space for encoding similarity 
5^ . information for images and videos, (2) kernels in time space in case of videos for incorporating temporal relationship, 
and (3) kernels on unlabeled image in the semi-supervised learning and transfer learning settings. 

Most sparse coding training algorithms fall under the general category of alternating procedures with a convex lasso 
regression sub-problem. While efficient algorithms for such cases exist [22l [TT| , their scalability for large dictionaries 
remains a challenge. We propose a novel training method for sparse coding based on marginal regression, rather 
than solving the traditional alternating method with lasso sub-problem. Marginal regression corresponds to several 
univariate linear regression followed by a thresholding step to promote sparsity. For large dictionary sizes, this leads to 
a dramatic speedup compared to traditional sparse coding methods (up to two orders of magnitude) without sacrificing 
statistical accuracy. 

We further develop theory that extends the sample complexity result of |20j for dictionary learning using standard 
sparse coding to the smooth sparse coding case. We specifically show how the sample complexity depends on the L\ 
norm of the kernel function used. 
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Our main contributions are: (1) proposing a framework based on kernel-smoothing for incorporating feature, time 
or other similarity information between the samples into sparse coding, (2) providing sample complexity results for 
dictionary learning using smooth sparse coding, (3) proposing an efficient marginal regression training procedure for 
sparse coding, and (4) successful application of the proposed method in various classification tasks. Our contributions 
lead to improved classification accuracy in conjunction with computational speedup of two orders of magnitude. 



2 Related work 

Our approach is related to the local regression method [131 [7] . More recent related work is [15] that uses smoothing 
techniques in high-dimensional lasso regression in the context of temporal data. Another recent approach proposed 
by US] achieves code locality by approximating data points using a linear combination of nearby basis points. The 
main difference is that traditional local regression techniques do not involve basis learning. In this work, we propose 
to learn the basis or dictionary along with the regression coefficients locally. 

In contrast to previous sparse coding papers we propose to use marginal regression for learning the regression 
coefficients, which results in a significant computational speedup with no loss of accuracy. Marginal regression is a 
relatively old technique that has recently reemerged as a computationally faster alternative to lasso regression [5] . See 
also for a statistical comparison of lasso regression and marginal regression. 



3 Smooth Sparse Coding 

Notations: The notations x and X correspond to vectors and matrices respectively, in appropriately defined dimen- 
sions; the notation || • || p corresponds to the L p norm of a vector (we use mostly use p = 1, 2 in this paper); the notation 
|| • \\f corresponds to the Frobenius norm of a matrix; the notation |/| p corresponds to the L p norm of the function /: 
(J \f\ p dfJ,) 1 / p ; the notation Xi, i = 1, . . . , n corresponds to the data samples, where we assume that each sample Xi is 
a d-dimcnsional vector. The explanation below uses L\ norm for sparsity for simplicity. But the method applies more 
generally to any structured regularizers, for e.g., [31 IH] - 

The standard sparse coding problem consists of solving the following optimization problem, 



ftgl K ,i=l,..,n 

subject to 1 1 d j 1 1 2 < 1 j — 1, • • • K 
||&||i<A 1 = 1,. ..n. 

where /Jj £ R K corresponds to the encoding of sample Xi with respected to the dictionary D £ R dxK and dj £ M. d 
denotes the j-column of the dictionary matrix D. The dictionary is typically over-complete, implying that K > d. 

Object recognition is a common sparse coding application where Xj corresponds to a set of features obtained from 
a collection of image patches, for example SIFT features P3]. The dictionary D corresponds to an alternative coding 
scheme that is higher dimensional than the original feature representation. The L\ constraint promotes sparsity of the 
new encoding with respect to D. Thus, every sample is now encoded as a sparse vector that is of higher dimensionality 
than the original representation. 

In some cases the data exhibits a structure that is not captured by the above sparse coding setting. For example, 
SIFT features corresponding to samples from the same class are presumably closer to each other compared to SIFT 
features from other classes. Similarly in video, neighboring frames are presumably more related to each other than 
frames that are farther apart. In this paper we propose a mechanism to incorporate such feature similarity and 
temporal information into sparse coding, leading to a sparse representation with an improved statistical accuracy (for 
example as measured by classification accuracy). 

We consider the following smooth version of the sparse coding problem above: 

n n 

y2y2 w ( x j> x i)\\ x 3- D Pi\\i c 1 ) 

fte^^i,...,,,'- 15 - 1 

subject to ||dj||2 < 1 j = 1,...K (2) 
11/3,111 < A i = l,...n. (3) 
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where w { x ji x i) — 1 f° r ai l *■ It is convenient to define the weight function through a smoothing kernel 

where p(-, •) is a distance function that captures the feature similarity, hi is the bandwidth, and JCi is a smoothing 
kernel. Traditional sparse coding minimizes the reconstruction error of the encoded samples. Smooth sparse coding, 
on the other hand, minimizes the reconstruction of encoded samples with respect to their neighbors (weighted by the 
amount of similarity). 

The smooth sparse coding setting leads to codes that represent a neighborhood rather than an individual sample 
and that have lower mean square reconstruction error (with respect to a given dictionary), due to lower estimation 
variance (see for example the standard theory of smoothed empirical process [1]). 

3.1 The choice of smoothing kernel 

There are several possible ways to determine the weight function w. One common choice for the kernel function is 
the Gaussian kernel whose bandwidth is selected using cross-validation. Other common choices for the kernel are 
the triangular, uniform, and tricube kernels. The bandwidth may be fixed throughout the input space, or may vary 
in order to take advantage of non-uniform samples. We use in our experiment the tricube kernel with a constant 
bandwidth. 

The distance function p(-, ■) may be one of the standard distance functions (for example based on the L p norm). 
Alternatively, p(-,-) may be expressed by domain experts, learned from data before the sparse coding training, or 
learned jointly with the dictionary and codes during the sparse coding training. 

3.2 Spatio- Temporal smoothing 

In spatio-temporal applications we can extend the kernel to include also a term reflecting the distance between the 
corresponding time or space 

Above, K,2 is a univariate symmetric kernel with bandwidth parameter h,2- One example is video sequences, where the 
kernel above combines similarity of the frame features and the time-stamp. 

Alternatively, the weight function can feature only the temporal component and omit the first term containing the 
distance function between the feature representation. A related approach for that situation, is based on the Fused lasso 
which penalizes the absolute difference between codes for neighboring points. The main drawback of that approach is 
that one needs to fit all the data points simultaneously whereas in smooth sparse coding, the coefficient learning step 
decomposes as n separate problems which provides a computational advantage (see Section 19.1.51 for more details) . 
Also, while fused Lasso penalty is suitable for time-series data to capture relatedness between neighboring frames, it 
may not be immediately suitable for other situations that the proposed smooth sparse coding method could handle. 

4 Marginal Regression for Smooth Sparse Coding 

A standard algorithm for sparse coding is the alternating bi-convex minimization procedure, where one alternates 
between (i) optimizing for codes (with a fixed dictionary) and (ii) optimizing for dictionary (with fixed codes). Note 
that step (i) corresponds to regression with L\ constraints and step (ii) corresponds to least squares with L2 constraints. 
In this section we show how marginal regression could be used to obtain better codes faster (step (i)). In order to do 
so, we first give a brief description of the marginal regression procedure. 

Marginal Regression: Consider a regression model y = X{3 + z where y € R™, j3 € R p , X £ M. nxp with L2 
normalized columns (denoted by Xj), and z is the noise vector. Marginal regression proceeds as follows: 

• Calculate the least squares solution 

a \JI — 
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• Threshold the least-square coefficients 

ft) = &vh {l&w>t} , j = i,..., P . 

Marginal regression requires just 0(np) operations compared to 0(p 3 + np 2 ), the typical complexity of lasso 
algorithms. When p is much larger than n, marginal regression provides two orders of magnitude over Lasso based 
formulations. Note that in sparse coding, the above speedup occurs for each iteration of the outer loop, thus enabling 
sparse coding for significantly larger dictionary sizes. Recent studies have suggested that marginal regression is a 
viable alternative for Lasso given its computational advantage over lasso. A comparison of the statistical properties 
of marginal regression and lasso is available in [5j [6] . 

Applying marginal regression to smooth sparse coding, we obtain the following scheme. The marginal least squares 
coefficients are 

>(fc) _ w(xj,Xi) T 



U \\*kh k r 



We sort these coefficient in terms of their absolute values, and select the top s coefficients whose L\ norm is bounded 
by A: 



'a„ (fe) k G S 



where 
k i S 



S= \ l,...,s : s<d:J2\4 k) \ < x 

k=l 



We select the thresholding parameter using cross validation in each of the sparse coding iterations. Note that the 
same approach could be used with structured regularizers too, for example [HIE]. 

Marginal regression works well when there is minimal correlation between the different dictionary atoms. In the 
linear regression setting, marginal regression performs much better with orthogonal data [6] . In the context of sparse 
coding, this corresponds to having uncorrelated or incoherent dictionaries |19| . One way to measure such incoherence 
is using the babel function, which bounds the maximum inner product between two different columns , dj : 

Us(D) = max max > \dJdA. 

<• !i \ !i "I I ' ! : A J 

An alternative, which leads to easier computation is enforcing the constraint \\D T D — Ikxk\\ 2 f when optimizing over 
the dictionary matrix D 

n 

D = argmin \\xi — -D/3j||§, where 
nev frt 

V = {D G R dxK : IK-HI < 1, \\D T D -I\\%< 7}. 

We use the method of optimal directions update [T7] to solve the above optimization problem. Specifically, repre- 
senting the constraints using the Lagrangian and setting the derivative with respect to D to zero, we get the following 
update rule 

D{t+i) = (B {t+1) Bj t+1) + 2KDjD t + 2r 1 dmg{DjD t ) 
(XBj t+l) + 2{k + r,)D t 

Above, B t = [$i(t), ■ ■ . , $ n (t)] is the matrix of data codes obtained in iteration t, X £ W xn is the data in 
matrix format, k is a regularization parameter corresponding to the incoherence constraints, and 77 is a regularization 
parameter corresponding to the normalization constraints. Note that if k = rj = 0, the update reduces to standard 
least squares update with no constraints. 

A sequence of such updates corresponding to step (i) and step (ii) converges to a stationary point of the optimization 
problem (this can be shown using Zangwill's theorem [27; ). But no provable algorithm that converges to the global 
minimum of the smooth sparse coding (or standard sparse coding) exists yet. Nevertheless, the main idea of this 
section is to speed-up the existing alternating bi-convex minimization procedure for obtaining sparse representations, 
by using marginal regression. 
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Algorithm 1 Smooth Sparse Coding via Marginal Regression 



Input: Data {(xi,j/i), . . . , (x n , y n )} and kernel/similarity measure K\ and d\. 

Precompute: Compute the weight matrix w(i,j) using the kernel/similarity measure and 

Initialize: Set the dictionary at time zero to be Dq. 

Algorithm: 

repeat 

Step (i): For all i = 1, . . . , n, solve marginal regression: 



(fc ) _ w(Xj,Xi) T 

4 \\d k h k 3 

3=1 

A(fe)_/4 fc) 



[o j ^ s 

s 

S = {l,...,s;s<d:"£\a ( i k) \<\}. 



fc=i 



Step (ii): Update the dictionary based on codes from previous step. 

n 

D t = argmin V" \\xi - Dpi(t)\\%, where 
V = {D e R dxK : ||<y£ < l, ||£> T £> - /||| < 7} 



until convergence 

Output: Return the learned codes and dictionary. 



5 Sample Complexity of Smooth sparse coding 

In this section, we analyze the sample complexity of the proposed smooth sparse coding framework. Specifically, 
since there does not exist a provable algorithm that converges to the global minimum of the optimization problem in 
Equation ([l]) , we provide uniform convergence bounds over the dictionary space and thereby prove a sample complexity 
result for dictionary learning under smooth spare coding setting. We leverage the analysis for dictionary learning in 
the standard sparse coding setting by [20] and extend it to the smooth sparse coding setting. The main difficulty for 
the smooth sparse coding setting is obtaining a covering number bound for an appropriately defined class of functions 
(see Theorem 1 for more details). 

We begin by re-representing the smooth sparse coding problem in a convenient format for analysis. Let x\, . . . ,x n 
be independent random variables with a common probability measure P with a density p. We denote by P„ the 
empirical measure over the n samples, and the kernel density estimate of p is defined by 

i=l v ' 

Let /C/ ll (-) = 7^-^i(^)- With the above notations, the reconstruction error at the point x is given by 

r x {x) = I nmi||s / -I)^|| 2 X:h 1 (p(x,a/))dP n (s') 

where 

S\ = {/3: ||/3||i < A}. 

The empirical reconstruction error is 

Ep„(t) = // mh-i\\x' - D/3\\ 2 ICh 1 (p(x,x'))dF n (x')dx 
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and its population version is 

Ep(r) = // min \\x' - D0\\ 2 K hl (p{x, x')) dP(x') dx. 

Our goal is to show that the sample reconstruction error is close to the true reconstruction error. Specifically, to 
showEp(r>) < (f + ft)Ep ji (r\) + e where e, k > 0, we bound the covering number of the class of functions corresponding 
to the reconstruction error. We assume a dictionary of bounded babel function, which holds as a result of the relaxed 
orthogonality constraint used in the Algorithm [1] (see also [T7]). We define the set of r functions with respect the the 
dictionary D (assuming data lies in the unit <i-dimensional ball S d_1 ) by 

Fx = {rx : S d - X -> R : D £ M. dxK , \\di\\ 2 < l,(i s (D) < 7}. 

The following theorem bounds the covering number of the above function class. 

Theorem 5.1. For every e > 0, the metric space (Fx, \ ■ |oo) has a subset of cardinality at most ( — e (i~-y) 1 ) > suc -h 
that every element from the class is at a distance of at most e from the subset, where {IC^ (-)|i = / ( x ) \ dP. 

Proof. Let F' x = {r\ : § d_1 -> R : D £ d x K, \\di\\ 2 < 1}, where r' x (x) = min^ SlSA \\Df3 - x\\. With this definition we 
note that Fx is just F' x convolved with the kernel JCh^-)- By Young's inequality [4] we have, 

|£/n * (si - s 2 )\ p < \IC hl \i\si - s 2 \ p , 1 < p < 00 

for any L p integrablc functions Si and s 2 . Using this fact, we see that convolution mapping between metric spaces F' 
and F converts xr . e , covers into e covers. From 1201 . we have that the the class F' x has e covers of size at most 

l>Chi(-)|i 1 — " 

( e (i- 7 ) ) • This proves the the statement of the theorem. □ 

This leads to the following generalization bound for the smooth sparse coding. 

Theorem 5.2. Let 7 < 1, A > e/4 with distribution P on S''" 1 . Then with probability at least 1 — e~* over the n 
samples drawn according to W, for all the D with unit length columns and pb s {D) < 7, we have: 



Ep(rx) < E Pn (rx) + 



2n V 2n V n 



'dm(CVn) t \ 4 



The above theorem follows from the previous covering number bound and the following lemma for generalization 
bound that is based on the result in |20j concerning | ■ |oo covering numbers. 

Lemma 1. Let Q be a function class of [0,B] functions with covering number (^f ) d > "W un< ^ er I ' loo norm. Then 
for every t > with probability at least 1 — e , for all q £ Q, we have: 

Ef < E n f + B . 

IV 2n 2n I V n 

The above theorem, shows that the generalization error scales as 0(n -1 / 2 ) (assuming the other problem parameters 
fixed). In the case of k > 0, it is possible to obtain faster rates of 0(n~ 1 ) for smooth sparse coding, similar to derivations 
in pp. The following theorem gives the precise statement. 

Theorem 5.3. Let 7 < 1, A > e/4, dK > 20 and n > 5000. Then with probability at least 1 — e -t , we have for all D 
with unit length and pL s {D) < 7, 

E P (r A ) < l.l£ Pn (r A ) + 9 ^ — . 

n 

The above theorem follows from the covering number bound above and Proposition 22 from |20| . The definition 
of rx(x) differs from (1) by a square term, but it could easily be incorporated into the above bounds resulting in an 
additive factor of 2 inside the logarithm term. 
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6 Experiments 



We demonstrate the advantage of the proposed approach both in terms of speed-up and accuracy, over standard sparse 
coding. A detailed description of all real-world data sets used in the experiments are given in the appendix. 

6.1 Speed comparison 

We conducted synthetic experiments to examine the speed-up provided by sparse coding with marginal regression. 
The data was generated from a a 100 dimensional mixture of two Gaussian distribution that satisfies \\fii — H2W2 = 3 
(with identity covariance matrices). The dictionary size was fixed at 1024. 

We compare the proposed smooth sparse coding algorithm, standard sparse coding with lasso |llj and marginal 
regression updates respectively, with a relative reconstruction error \\X — DB\\f/\\X\\f convergence criterion. We 
experimented with different values of the relative reconstruction error (less than 10%) and report the average time. 
From Table [TJ we see that smooth sparse coding with marginal regression takes significantly less time to achieve a 
fixed reconstruction error. This is due to the fact that it takes advantage of the spatial structure and use marginal 
regression updates. It is worth mentioning that standard sparse coding with marginal regression updates performs 
faster compared to the other two methods that uses lasso updates, as expected (but does not take into account the 
spatial structure). 



Method 


time (sec) 


SC+LASSO 


560.4 ±13 


SC+MR 


250.6±18 


SSC+LASSO 


540.2±12 


SSC+MR 


186.4 ±10 



Table 1: Time comparison of coefficient learning in SC and SSC with either Lasso or Marginal regression updates. 
The dictionary update step was same for all methods. 

6.2 Experiments with Kernel in Feature space 

We conducted several experiments demonstrating the advantage of the proposed coding scheme in different settings. 
Concentrating on face and object recognition from static images, we evaluated the performance of the proposed 
approach along with standard sparse coding and LLC [26], another method for obtaining sparse features based on 
locality. Also, we performed experiments on activity recognition from videos based on both space and time based 
kernels. As mentioned before all results are reported using tricube kernel. 

6.2.1 Image classification 

We conducted image classification experiments on CMU-multipie, 15 Scene and Caltech-101 data sets. Following [53] 
, we used the following approach for generating sparse image representation: we densely sampled 16 x 16 patches 
from images at the pixel level on a gird with step size 8 pixels, computed SIFT features, and then computed the 
corresponding sparse codes over a 1024-size dictionary. We used max pooling to get the final representation of the 
image based on the codes for the patches. The process was repeated with different randomly selected training and 
testing images and we report the average per-class recognition rates (together with its standard deviation estimate) 
based on one-vs-all SVM classification. We used cross validation to select the regularization and bandwidth parameters. 

As Table [2] indicates, our smooth sparse coding algorihtm resulted in significantly higher classification accuracy 
than standard sparse coding and LLC. In fact, the reported performance is better than previous reported results using 
unsupervised sparse coding techniques |24j . 

Dictionary size: In order to demonstrate the use of scalability of the proposed method with respect to dictionary 
size, we report classification accuracy with increasing dictionary sizes using smooth sparse coding. The main advantage 
of the proposed marginal regression training method is that one could easily run experiments with larger dictionary 
sizes, which typically takes a significantly longer time for other algorithms. For both the Caltech-101 and 15-scene 
data set, classification accuracy increases significantly with increasing dictionary sizes as seen in Table [3] 
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CMU-multipie 


15 scene 


Caltech-101 


sc 


92.70±1.21 


80.28±2.12 


73.20±1.14 


LLC 


93.70±2.22 


82.28±1.98 


74.82±1.65 


ssc 


94.14 ±2.01 


84.10±1.87 


76.24±2.15 



Table 2: Test set error accuracy for face recognition on CMU-multipie data set (left) 15 scene (middle) and Caltech- 
101 (right) respectively. The performance of the smooth sparse coding approach is better than the standard sparse 
coding and LLC in all cases. 



Dictionary size 


15 scene 


Caltech-101 


1024 


84.10±1.87 


76.24 ±2.15 


2048 


87.43±1.55 


78.33±1.43 


4096 


89.53±2.00 


79.11±0.87 



Table 3: Effect of dictionary size on classification accuracy using smooth sparse coding and marginal regression on 
15 scene and Caltech -101 data set. 

6.2.2 Action recognition: 

We further conducted an experiment on activity recognition from videos with KTH action and YouTube data set (see 
Appendix). Similar to the static image case, we follow the standard approach for generating sparse representations 
for videos as in [21]. We densely sample 16 x 16 x 10 blocks from the video and extract HoG-3d [TU] features from the 
sampled blocks. We then use smooth sparse coding and max-pooling to generate the video representation (dictionary 
size was fixed at 1024 and cross-validation was used to select the regularization and bandwidth parameters). Previous 
approaches include sparse coding, vector quantization, and fc-mcans on top of the HoG-3d feature set (see [2T] for a 
comprehensive evaluation). As indicated by Tabled smooth sparse coding results in higher classification accuracy 
than previously reported state-of-the-art and standard sparse coding on both datasets (see [21] [12] for a description 
of the alternative techniques). 

6.2.3 Discriminatory power 

In this section, we describe another experiment that contrasts the codes obtained by sparse coding and smooth sparse 
coding in the context of a subsequent classification task. As in [25], we first compute the codes in both case based 
on patches and combine it with max-pooling to obtain the image level representation. We then compute the fisher 
discriminant score (ratio of within-class variance to between-class variance) for each dimension as measures of the 
discrimination power realized by the representations. 

Figure [T] graphs a histogram of the ratio of smooth sparse coding Fisher score over standard sparse coding Fisher 
score R(d) = Fi(d) / F 2 (d) for 15-scene dataset (left) and Youtube dataset (right). Both histograms demonstrate the 
improved discriminatory power of smooth sparse coding over regular sparse coding. 

6.3 Experiments using Temporal Smoothing 

In this section we describe an experiment conducted using the temporal smoothing kernel on the Youtube persons 
dataset. We extracted SIFT descriptors for every 16 x 16 patches sampled on a grid of step size 8 and used smooth 
sparse coding with time kernel to learn the codes and max pooling to get the final video representation. We avoided 
pre-processing steps such as face extraction or face tracking. Note that in the previous action recognition video 
experiment, video blocks were densely sampled and used for extracting HoG-3d features. In this experiment, on the 
other hand, we extracted SIFT features from individual frames and used the time kernels to incorporate the temporal 
information into the sparse coding process. 

For this case, we also compared to the more standard fused- lasso based approach [T5]. Note that in fused Lasso based 
approach, in addition to the standard L± penalty, an additional L\ penalty on the difference between the neighboring 
frames for each dimensions is used. This tries to enforce the assumption that in a video sequence, neighboring frames 
are more related to one another as compared to frames that are farther apart. 

Table [5] shows that smooth sparse coding achieved higher accuracy than fused lasso and standard sparse coding. 
Smooth sparse coding has comparable accuracy on person recognition tasks to other methods that use face-tracking, 
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Cited method 


SC 


SSC 


92.10 21 


92.423 


93.549 


71.2 [12] 


72.640 


74.974 



Table 4: Action recognition (accuracy) for cited method (left), Hog3d+ SC (middle) and Hog3d+ SSC (right): KTH 
data set (top) YouTube action dataset (bottom). 




Figure 1: Comparison between the histograms of Fisher discriminant score realized by sparse coding and smooth sparse 
coding. The images represent the histogram of the ratio of smooth sparse coding Fisher score over standard sparse 
coding Fisher score (left: image data set; right: video). A value greater than 1 implies that smooth sparse coding is 
more discriminatory. 



for example [9]. Another advantage of smooth sparse coding is that it is significantly faster than sparse coding and 
the used lasso. 



7 Semi-supervised smooth sparse coding 

One of the primary difficulties in some image classification tasks is the lack of availability of labeled data and in some 
cases, both labeled and unlabeled data (for particular domains). This motivated semi-supervised learning and transfer 
learning without labels |16| respectively. The motivation for such approaches is that data from a related domain might 
have some visual patterns that might be similar to the problem at hand. Hence, learning a high-level dictionary based 
on data from a different domains aids the classification task of interest. 

We propose that the smooth sparse coding approach might be useful in this setting. The motivation is as follows: 
in semi-supervised, typically not all samples from a different data set might be useful for the task at hand. Using 
smooth sparse coding, one can weigh the useful points more than the other points (the weights being calculated based 
on feature/time similarity kernel) to obtain better dictionaries and sparse representations. Other approach to handle 
a lower number of labeled samples include collaborative modeling or multi-task approaches which impose a shared 
structure on the codes for several tasks and use data from all the tasks simultaneously, for example group sparse 
coding PJ. The proposed approach provides an alternative when such collaborative modeling assumptions do not hold, 
by using relevant unlabeled data samples that might help the task at hand via appropriate weighting. 

We now describe an experiment that examines the proposed smoothed sparse coding approach in the context of 
semi-supervised dictionary learning. We use data from both CMU multi-pic dataset (session 1) and faces-on-tv dataset 
(treated as frames) to learn a dictionary using a feature similarity kernel. We follow the same procedure described in 
the previous experiments to construct the dictionary. In the test stage we use the obtained dictionary for coding data 
from sessions 2, 3, 4 of CMU-multipie data set, using smooth sparse coding. Note that semi-supervision was used only 
in the dictionary learning stage (the classification stage used supervised SVM). 

Table [6] shows the test set error rate and compares it to standard sparse coding and LLC [26] . Smooth sparse 
coding achieves significantly lower test error rate than the two alternative techniques. We conclude that the smoothing 
approach described in this paper may be useful in cases where there is a small set of labeled data, such as semisupervised 
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Method 


Fused Lasso 


SC 


SSC-tricube 


Accuracy 


68.59 


65.53 


69.01 



Table 5: Linear SVM accuracy for person recognition task from YouTube face video dataset. 
learning and transfer learning. 



Method 


SC 


LLC 


SSC-tricube 


Test errror 


6.345 


6.003 


5.135 



Table 6: Semi-supervised learning test set error: Dictionary learned from both CMU multi-pie and faces-on-tv data 
set using feature similarity kernel, used to construct sparse codes for CMU multiple data set. 



8 Discussion and Future work 

We proposed a simple framework for incorporating similarity in feature space and space or time into sparse coding. 
The codes obtained by smooth sparse coding are significantly more discriminatory than traditional sparse coding, and 
lead to substantially improved classification accuracy as measured on several different image and video classification 
tasks. 

We also propose in this paper modifying sparse coding by replacing the lasso optimization stage by marginal 
regression and adding a constraint to enforce incoherent dictionaries. The resulting algorithm is significantly faster 
(speedup of about two-orders of magnitude over standard sparse coding) . This facilitates scaling up the sparse coding 
framework to large dictionaries, an area which is usually restricted due to intractable computation. We also explore 
promising extensions to temporal smoothing, semi-supervised learning and transfer learning. We provide bounds on 
the covering numbers that lead to generalization bounds for the smooth sparse coding dictionary learning problem. 

There are several ways in which the proposed approach can be extended. First, using an adaptive or non-constant 
kernel bandwidth should lead to higher accuracy. It is also interesting to explore tighter generalization error bounds 
by directly analyzing the solutions of the marginal regression iterative algorithm. Another potentially useful direction 
is to explore alternative incoherence constraints that lead to easier optimization and scaling up. 
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9 Appendix 



9.1 Data set Description 

9.1.1 CMU Multi-pie face recognition: 

The face recognition experiment was conducted on the CMU Multi-PIE dataset. The dataset is challenging due to 
the large number of subjects and is one of the standard data sets used for face recognition experiments. The data set 
contains 337 subjects across simultaneous variations in pose, expression, and illumination. We ignore the 88 subjects 
that were considered as outliers in [24] and used the rest of the images for our face recognition experiments. We 
follow [53] and use the 7 frontal extreme illuminations from session one as train set and use other 20 illuminations 
from Sessions 2-4 as test set. 

9.1.2 15 Scenes Categorization: 

We also conducted scene classification experiments on the 15-Scenes data set. This data set consist of 4485 images 
from 15 categories, with the number of images each category ranging from 200 to 400. The categories corresponds to 
scenes from various settings like kitchen, living room etc. Similar to the previous experiment, we extracted patches 
from the images and computed the SIFT features corresponding to the patches. 

9.1.3 Caltech-101 Data set: 

The Caltech-101 data set consists of images from 101 classes like animals, vehicles, flowers, etc. The number of images 
per category varies from 30 to 800. Most images are of medium resolution (300 x 300). All images are used a gray-scale 
images. Following previous standard experimental settings for Caltech-101 data set, we use 30 images per category 
and test on the rest. Average classification accuracy normalized by class frequency is used for evaluation. 

9.1.4 Activity recognition 

The KTH action dataset consists of 6 human action classes. Each action is performed several times by 25 subjects 
and is recorded in four different scenarios. In total, the data consists of 2391 video samples. The YouTube actions 
data set has 11 action categories and is more complex and challenging [12]. It has 1168 video sequences of varied 
illumination, background, resolution etc. We randomly densely sample blocks (400 cuboids) of video from the data 
sample and extract HOG-3d features and constructed the video features as described above. 

9.1.5 Youtube person data set 

Similar to the experiments using the feature smoothing kernel, in this section we report results on experiment conducted 
using the time smoothed kernel. Specifically, we used the YouTube person data set [9] in order to recognize people, 
based on time-based kernel smooth sparse coding. The dataset contains 1910 sequences of 47 subjects. The architecture 
for this dataset is similar to |23| . 
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